Ford Go Bike Data Exploration

by Mostafa Hasan Mahmoud

Preliminary Wrangling

This document explores a dataset containing information about individual rides made in a bike-sharing system covering the greater San Francisco Bay area from 2019-02-01 to 2019-02-28 with a total of 28 days, containing 183412 rows and 16 columns.

Data Cleaning

Found that:

  1. Total station count is 329 station.
  2. Total bikes count is 4607 bike.
  3. There are 2 categories of users in User types.
  4. Member gender consists of 3 types.
  5. Members ages contains 75 unique value.

Observed

Observed

Observed

What is the structure of your dataset?

There are 174952 rows in the dataset (after the cleaning process) with 18 features.

The 18 features columns are:

['duration_sec', 'start_time', 'end_time', 'start_station_id', 'start_station_name', 'end_station_id', 'end_station_name', 'bike_id', 'user_type', 'member_gender', 'bike_share_for_all_trip', 'stday_name', 'stday_num', 'sthour', 'enday_name', 'enday_num', 'enhour', 'member_age_2019'].

Most variables are numeric in nature, but the variables (start_station_name, end_station_name) are nominal variables, (user_type, member_gender, bike_share_for_all_trip, stday_name, enday_name) are categorical variables, and (start_time, end_time) are datetime variables.

What is/are the main feature(s) of interest in your dataset?

I am more interested in discovering which features are better to predict with:

  1. What are the most used stations?
  2. When are the peak hours?
  3. When are most rides made in terms of the time of day and day of the week?
  4. How long does the average ride take?
  5. Does the duration of the ride depend on whether the user is a subscriber or a customer?
  6. Who makes the most rides in terms of gender?
  7. Do all bikes make the same number of rides?
  8. Does the age of the user affect the ride duration?
  9. Does the age of the user affect the user type?

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

I expect that:

['duration_sec', 'start_station_name', 'end_station_name', 'bike_id', 'user_type', 'member_gender', 'bike_share_for_all_trip', 'stday_name', 'stday_num', 'sthour', 'member_age_2019']

will have the strongest effect on my investigation.

Univariate Exploration

I'll start by looking at the distribution of the main variable of interest:

['duration_sec', 'start_station_name', 'end_station_name', 'bike_id', 'user_type', 'member_gender', 'bike_share_for_all_trip', 'stday_name', 'stday_num', 'sthour', 'member_age_2019']

Observed

Observed:

Observed:

Observed:

Needs more investigation

Observed:

Observed:

Observed:

Observed:

Observed:

Needs more investigation

Observed:

Observed:

Observed:

Needs more investigation

Observed:

Observed

Observed:

Needs more investigation

Observed:

Observed:

Observed:

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

Needs more investigation

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Bivariate Exploration

To start off with, I want to look at the pairwise correlations present between features in the data.

Observed:

Plan on investigating next:

Observed

Observed

-There doesn't seem to be that much interaction between the categories variables above, though proportionally it seems like there might be more rides for Males, Subscribers, No bike_share_for_all_trip users.

With the preliminary look at bivariate relationships out of the way, I want to dig into some of the relationships more. First, I want to see how 'bike_id' and ('stday_num', 'enday_num') are related to one another for all of the data.

Observed

Observed:

Observed

Observed:

Observed:

Observed:

Observed:

Observed:

Observed:

Observed:

Observed:

Observed:

Observed:

Observed:

Observed:

Observed:

Observed:

Observed:

Observed

Observed:

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Multivariate Exploration

The main thing I want to explore in this part of the analysis is how the categorical variables of 'user_type', 'member_gender', 'bike_share_for_all_trip', 'stday_name' play into the relationship between 'duration_sec', 'stday_num', and 'member_age_2019'.

Observed:

Observed:

Observed:

Observed:

Observed:

Observed:

Observed:

Observed:

Observed:

Observed:

Observed

Observed

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Were there any interesting or surprising interactions between features?